Proof of biased behavior of Normalized Mutual Information

The Normalized Mutual Information (NMI) metric is widely utilized in the evaluation of clustering and community detection algorithms. This study explores the performance of NMI, specifically examining its performance in relation to the quantity of communities, and uncovers a significant drawback associated with the metric's behavior as the number of communities increases. Our findings reveal a pronounced bias in the NMI as the number of communities escalates. While previous studies have noted this biased behavior, they have not provided a formal proof and have not addressed the causation of this problem, leaving a gap in the existing literature. In this study, we fill this gap by employing a mathematical approach to formally demonstrate why NMI exhibits biased behavior, thereby establishing its unsuitability as a metric for evaluating clustering and community detection algorithms. Crucially, our study exposes the vulnerability of entropy-based metrics that employ logarithmic functions to similar bias.


Related works
CD algorithms aim to identify groups of nodes characterized by dense interconnections compared to the rest of the network 39,40 .Girvan and Newman 39 introduced the modularity metric to evaluate the accuracy of communities detected by their algorithm, sparking the development of numerous algorithms based on this metric.However, Fortunato 26 highlighted a resolution limit in the modularity metric, indicating its inability to detect small-sized communities.Cai et al. 31 further demonstrated that maximizing modularity is an NP-hard problem and that a random network without any communities can achieve a high Q value.Chen, Nguyen, and Szymanski 41 underscored the inconsistencies of the modularity metric, noting its tendency to favor either small or large communities in different scenarios.They proposed a new measure, modularity density, which combines modularity with split penalty and community density to circumvent the dual problems inherent in modularity.
The NMI was first considered a precise metric by Danon et al. 36 , who reported its sensitivity to errors in the community detection procedure.They consider Z out as the average number of links a node has to members of any other community, by increasing Z out NMI tends to be zero.Subsequent research has addressed the limita- tions of the NMI measure, with Romano et al. 42 emphasizing the role of the number of clusters in the evaluation metrics.Amelio and Pizzuti 30 argued that the NMI is not fair, as solutions with a high number of clusters receive disproportionately high NMI.Zhang 34 demonstrated that the NMI is significantly affected by systematic errors due to finite network sizes and proposed the relative normalized mutual information (rNMI).Lai and Nardini 32 introduced the corrected normalized mutual information (cNMI) to address the reverse finite size problem of the rNMI.Liu, Cheng, and Zhang 33 highlighted the drawbacks of NMI and its improved versions, such as rNMI and cNMI, noting that these measures often overlook the importance of small communities.Rossetti, Pappalardo, and Rinzivillo 43 introduced community precision and community recall to evaluate CD algorithms, addressing the high computational complexity of the NMI.Arab and Hasheminezhad 44 also reported scalability problems with the NMI in large-scale data.
Other researchers have proposed alternative measures for evaluating community detection and clustering algorithms.Meilă 45 introduced the variation of information (VI) metric, an entropy-based measure that operates based on mutual information.Wagner and Wagner 46 categorized measures based on counting pairs, set overlaps, and mutual information and concluded that information theoretical measures outperform counting pairs and set overlaps measures.Santos and Embrechts 47 utilized the ARI for cluster validation and feature selection.Yang and Leskovec 48 compared 13 measures for evaluating community detection algorithms, categorizing them into four groups and concluding that conductance and triad-participation-ratio have the best performance in identifying communities.Saltz, Prat-Pérez, and Dominguez-Sal 49 introduced a new metric for the CD problem, weighted community clustering (WCC), which operates based on the distribution of triangles in the graph.

Normalized Mutual Information (NMI)
The NMI serves as a metric for assessing the performance of community detection algorithms.The NMI facilitates comparisons between two clusters or communities, yielding a value that ranges from 0 to 1.A higher value indicates a greater degree of similarity between two partitions or communities.As an external metric, the NMI necessitates the availability of class labels for computations, implying that the ground truth is required when employing this metric.The calculation of NMI is executed according to Eq. (1).
where I(A, B) is mutual information and H is the entropy as shown in Eqs. ( 2) and (3).
The expansion of Eq. (1) with respect to (2) and ( 3) is Eq. ( 4) Suppose there are two networks denoted as Net 1 and Net 2 each consisting of sets of vertices (V) and edges (E).Net 1 consists of R communities denoted as A = {A 1 , A 2 , . . .., A R } , while Net 2 consists of S communities denoted as B = {B 1 , B 2 , . . .., B S } .C ij denotes the number of nodes that clusters (communities) A i and B j share.If A = B , then NMI(A, B) = 1 ; if A and B are completely different, then NMI(A, B) = 0.
In addition to the NMI, some other measures can be used to evaluate the accuracy of CD and clustering algorithms.Table 1 lists well-known measures in this domain.We listed these measures here to highlight the differences between the NMI and other measures in practice.
Essentially, to compute the measures listed in Table 1, a contingency table (CT) is employed.This table is created based on the joint members between communities detected by a certain algorithm and the ground truth.The contingency table used for computing the NMI is presented in Table 2.
Table 3 presents important notations.

Problem statement
In this section, we present the main drawback of the NMI.We illustrate this problem through an example.Example 1. Suppose we have 40 members and eight gold standard communities (ground truth) as follows: The number of communities in the ground truth is 8, and it should remain constant across all the experiments.The second column shows the number of communities detected by a specific algorithm, while columns 3-8 display the values for each measure.The last column represents the number of ground truth communities Table 1.Well-known metrics for evaluating community detection algorithms.

Metric Formula Symbol
Normalized Mutual Information (NMI) Hubert Jaccard that share common members with the detected communities.For example, if a certain algorithm detects two communities as follows: The ω value is 8 since all members of the 8 ground truth communities share common members with the detected communities.
In this study, we assume the best-case scenario in which the detected communities of a certain algorithm lead to the highest NMI value.Thus, we propose: Axiom 1: The highest NMI value is obtained if and only if the detected communities have the highest possible number of common members with the ground truth and if the highest value of ω is maintained.Now, as shown in Table 4 and Fig. 1, everything appears to be fine when the number of R is less than S.However, the situation changes when the number of R increases and surpasses S. At first glance, it may not seem that there is a strong argument to support the claim that the value of metrics declines as the number of members decreases.However, upon closer examination, it becomes evident that the decrease in the NMI value is less steep compared to the other metrics.This raises the question of what is amiss when comparing all algorithms to a specific metric such as the NMI.
The problem arises when N is equal to the number of communities, indicating that each community includes only one member, which is the worst-case scenario.Surprisingly, even in this unfavorable situation, the NMI value still indicated a high level of efficiency.On the other hand, when R equals one or two, the NMI value returned is very low.For instance, Table 4 demonstrates that when R is 2, the NMI is 0.5, whereas in the worstcase scenario with R being 40, the NMI is 0.72.This finding implies that if Alg1 and Alg2 return 2 and 40 communities respectively, the NMI suggests that the accuracies of Alg1 and Alg2 are 0.5 and 0.72 respectively.Similarly, when R is 4, the NMI is 0.8, whereas when R is 27 the NMI is also 0.8.In both examples, the NMI suggests that a certain algorithm that detects more communities is better.However, in reality, a certain algorithm that detects 4 communities may be better than another algorithm that detects 27 communities with respect to the number of ground truth communities.

Proof of biased behavior of NMI
As discussed in the previous section, when the number of communities (R) increases and exceeds the number of communities in the ground truth (S), the NMI exhibits a biased behavior.Therefore, in this section, we aim to analyze the effect of quantity of communities on this measure.Firstly, we decompose the NMI formula, allowing us to examine how NMI values change from the minimum to the maximum.Subsequently, we present an explanation as to why NMI changes are more pronounced when R < S rather than R > S .Before we begin our discussion, it should be noted that for all Lammas and the relevant proof, Axiom 1 should be maintained.

Decomposition
For simplicity, we have Formula 4 (NMI) here.www.nature.com/scientificreports/We decompose the above equation to examine the behavior and performance of each component.The NMI consists of three main components: the common members ( C ij ), the sum of common members for each com- munity detected by a particular algorithm ( C .j ), and the sum of common members for each community in the ground truth ( C i. ).Here, S represents the number of communities in the ground truth, and R represents the number of communities detected by a specific algorithm.We denote the set of common members as Z. S represents the set of the sum of common members in each ground truth community, while R represents the set of common members in each community detected by a specific algorithm.S = {s 1 , s 2 , . . ., s s } is a set representing the number of members in each community of the ground truth, where s i denotes the number of members in community i.It is important to note that there exists an inverse relationship between each pair of elements in S. R = {r 1 , r 2 , . . ., r r } is a set representing the number of members in each community detected by a certain algorithm, where r j represents the number of members in community j.Again, there exists an inverse relation- ship between each pair of elements in R.
Suppose that: The L and M values are always negative, ∵ 0 N < 0 In addition, K is positive, therefore, the − 2 multiplied K is also negative and the NMI value is positive.and: www.nature.com/scientificreports/Lemma 1 N = R j=1 r j = S i=1 s i Proof In simple terms, a certain algorithm detects communities with N members, where these N members are distributed among R communities.Similarly, this term holds true for the ground truth, which consists of S communities.Below, we provide a formal proof: From minimum to maximum , then a certain algorithm considers all members in one community, on the other hand S > 1 demonstrates that there is more than one community in ground truth.The conclusion is that C ij = C i. and N = C .j therefore K = 0 .In a similar vein when S = 1 and R > 1 NMI is zero.Formally this can be shown as: and the CT is a square and diagonal matrix.Proof When R = S only diagonal elements of CT is non zero, to have a maximum NMI value of 1, the mem- bers of each community in the ground truth first should be the same in terms of quantity and second in terms of member similarity with the communities detected by a certain algorithm.It should be noted that this does not imply that the number of members in all communities should be the same.Therefore, when R equals S, a square matrix is formed, ensuring that there are similar members in each pair.This results in members not being distributed across two communities, thus resulting in a diagonal matrix for the contingency table.Having only one non-zero cell leads to: Since the CT is diagonal, only one cell in each row or column is considered to be non-zero ∴ In addition:

∴
Furthermore, this can be demonstrated through contradiction by considering the case when L and M are not equal, therefore R = S .If R > S , it implies that some members are distributed across different communities.This distribution decreases the number of common members, consequently increasing the absolute value of the denominator in the fraction.Consequently, the NMI decreases, as shown in Eqs. ( 10) and ( 11): Vol:.( 1234567890) www.nature.com/scientificreports/Now, if S > R , the absolute value of the numerator decreases.This occurs because the logarithmic reduction has a smaller slope compared to the effect of the coefficient.
In the given example, the roles of K and M are crucial in the NMI formula, where L is constant.Let us consider the two algorithms, Alg 1 and Alg 2, along with their respective communities.
Example 2 Suppose the ground truth consists of 4 communities: {a 1 , a 2 }, {a 3 , a 4 }, {a 5 , a 6 }, {a 7 , a 8 } Algorithm Alg 1 consists of two communities: {a 1 , a 2 , a 3 , a 4 }, {a 5 , a 6 , a 7 , a 8 } .In this case, the values for K, L, M, and NMI for Alg 1 are as follows: Now, consider Algorithm Alg 2, which consists of six communities:{a 1 , a 2 }, {a 3 , a 4 }, {a 5 }, {a 6 }, {a 7 }, {a 8 } .In this case, the values for K, L, M, and NMI for Alg 2 are as follows: These values demonstrate the roles of K and M in the NMI formula, and how they contribute to the calculation of the NMI value for Alg 1 and Alg 2, respectively.

R = S
Now, let us discuss the scenarios where the number of communities is greater or less than the number of communities in the ground truth.

R > S
It is clear that |Z| = R × S For this state: Lemma 4 If R > S , then the maximum value of NMI is obtained if and only if the cardinality of the set of common members (Z) is equal to R × (R − 1) , and if the number of pairwise common members ( C RS ) that are not equal to zero is equal to or greater than the cardinality of R.
First it is important to note that values smaller than |R|, indicating non-zero values in the cells of the contingency table, are not possible.This is because having a community without any common members violates the pigeonhole principle in mathematics.According to this principle, if n items are placed into m containers, where n > m , then at least one container must contain more than one item.
In this scenario, at least one detected community should have at least one common member, resulting in at least |R| being the non-zero member in Z.This can be illustrated with an example.

Example 3
Let us suppose that the ground truth has 2 communities, and a certain algorithm detects 3 communities as follows: Table 5 displays the number of common members (Z).The number of nonzero elements in Z is 3, which is equal to or greater than |R|.

Proof (lemma 4)
The greater the difference between R and S, the greater the number of communities detected.This results in fewer common members between the communities in R and S, subsequently leading to a decrease

R < S
For this state: Lemma 5 If R < S , then the maximum value of NMI is obtained if and only if the cardinality of the set of common members (Z) is equal to S × (S − 1) , and if the number of pairwise common members ( C RS ) that are not equal to zero is equal to the cardinality of S.
Like Lemma 4, Lemma 5 can be proven.

Lemma 6
When the difference between R and S increases, in cases where R < S , the slope of changes of the NMI values are greater than when R > S.
Proof By applying Lemma 4 and Lemma 5, we can demonstrate that if there is a difference between R and S, the optimal NMI value is obtained when the difference between R and S is 1.In other words, as the difference between R and S increases, the NMI decreases.
Equation ( 6) is equal to function 12 where we consider NMI as f (x, y, z) , K = g(x, y, z), M = h(z): For all cases, NlogN is a constant, therefore h(z) is completely dependent on P, s.t P < NlogN When S > R , increasing R results in a decrease in P, causing |h(z)| to increase.This might suggest that NMI decreases.However, g(x, y, z) increases, leading to an increase in NMI.Let us examine g(x, y, z) to further understand this.Now, let us consider Eq. ( 21) for two possible scenarios: 1. R < S (the number of detected communities is lower than that of the ground truth): In this scenario, let us suppose that Algorithm 1 returns N 1 communities in one run and N 2 communities in another run, where N 1 < N 2 . .According to Axiom 1, as the number of communities increases, the possibility of member distribution also increases, which leads to a reduction in the P value according to (22)  (14) and (13).A decrease in P results in a larger numerator in Eq. (21).This is because it increases the difference with C 2 , where C 2 , is a constant value for all the scenarios.∵ (14) and ( 22) P1 > P2 To further illustrate this, let us consider the following example: Example 4 Suppose the ground truth consists of 4 communities, Algorithm 1 returns 2 communities, Algorithm 2 returns 3 communities (in both cases, R < S ), and N = 8. ( 17) (17), and (18) ⇒ g x, y, z 13), ( 15) and ( 19)  Therefore, the NMIs of different algorithms depend on P and T. Let us consider Alg1 and Alg2 with P 1 , T 1 , P 2 , and T 2 respectively .Suppose Alg1 has a lower number of communities than Alg2.In this case, it can be proven that Eq. ( 23) holds.

R > S (the number of detected communities is greater than that of the ground truth)
To analyze the scenario where R > S , we start by incrementing R by one unit.This implies that, after the case where R = S , in the next state, a certain algorithm returns R + 1 communities.As a reminder, S is constant.In Table 7. www.nature.com/scientificreports/this case, one of the community members is distributed across two communities to maintain Axiom 1.This can be expressed as: By further increasing R, this process is repeated until the new community consists of only one member.The distribution of shared members from one community to new communities results in a lower value of P compared to previous states according to (22) and (14).Additionally, P = T in this case ( R > S ), the numerator remains constant for all scenarios.
The key issue here is that increasing R results in a non-linear (approximately logarithmic) decrease in P. The maximum value of P with respect to R is achieved if: …. ….. First, according to (22)   and According to Eq. ( 28), we can conclude that when R < S , the slope of the change curve is greater than when R > S .This results in a larger P value when R < S compared to states where R > S .Additionally, it can be proven that the T value when R < S is constant.Therefore, NMI when R < S changes with respect to P, and a greater change in P when R < S leads to a greater change in NMI.
Figure 2 illustrates a nonlinear (approximately logarithmic) decrease in the P value as the number of communities (R) increases.

Lemma 7
When R < S and Axiom 1 holds, T is a constant.www.nature.com/scientificreports/As the number of communities (R) becomes larger than the number of communities in the ground truth (S), the changes in P decrease.In this case, the value of T is equal to P, resulting in a constant numerator for the scenario when R > S .However, due to a small change in the denominator, the magnitude of change in NMI is smaller when R > S than when R < S. Lemma 8 When R > S and Axiom 1 holds, then T equals P.
Proof When R = S , the contingency table becomes a scalar matrix.
By increasing each unit in R when R > S then: Regarding the values of ( 24) to (27), for the case when R > S , the T value is equal to P, resulting in a constant numerator for this scenario.
Overall, when we summarize the NMI formula in Eq. ( 21), the logarithm function remains the key factor.One of the main characteristics of the logarithm function is its low slope change with respect to its variable.This characteristic is reflected in the descending derivative of the logarithm function.This observation forms the main intuition behind Eq. ( 28), which explains the biased behavior of NMI.Furthermore, this phenomenon can be generalized to other entropy-based metrics that utilize the logarithm function in their formulas.

Case study
In this section, we provide a case study to demonstrate why and how NMI returns biased results in practice.
To do this, we utilized the email-Eu-core network dataset 50 , which was generated using email data from a large European research institution.This dataset is provided by Jure Leskovec from Stanford university 51 , and according to 50-p.12 " we have anonymized information about all incoming and outgoing email of the research institution".It consists of 1005 nodes (N = 1005) and 25,571 edges.To compare the NMI value and reveal its biased behavior in different scenarios according to R value, seven well-known community detection algorithms were deployed, including multi-level, Louvain, leading eigenvector, infomap, walktrap, edge betweenness (GN), and Leiden.These algorithms are implemented through specific functions developed to detect communities within the 'igraph' package of the R programming language.The functions corresponding to the aforementioned algorithms are as follows: multilevel.community,cluster_louvain, cluster_leading_eigen, cluster_infomap, cluster_walktrap, cluster_edge_betweenness, and cluster_leiden.These functions take as input a network comprising nodes and edges and return the node name along with its corresponding community number.Interestingly, a diverse range of R values obtained by the above-mentioned algorithms helps us clearly show the limitations of NMI.Additionally, a plot of the cumulative distribution function (CDF) of the percentage of common members according to CT is illustrated for each community detection algorithm (see Fig. 3).Each point in this plot is labeled with a number representing the number of common members.For instance, if a point is labeled 1 and the corresponding value on the y-axis is 97, it means that 97% of the cells in CT have 1 or fewer common members.The labels '0' and corresponding CDF values on the y-axis represent the percentages of cells in the CT with a zero value.This indicates a lack of common members between the communities detected by a certain algorithm and the communities in the ground truth.The x-axis illustrates the common members observed in the cells of the CT.For example, in Fig. 3's top plot, the CDF of common members in the contingency table for the GN algorithm reveals the occurrence of values 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 16, 17, 19, and 53.These findings suggest that these values represent the number of common members between the communities detected by the GN algorithm and the ground truth.The CDF essentially shows the frequency of these values in the CT.It is evident that lower values are more commonly observed in the cells of the CT, with the exception of the Leiden algorithm.As we will discuss later, the Leiden algorithm returns a CT in which all cells are filled with the value '1' ." Table 8 shows the NMI values obtained by applying the above-mentioned algorithms to the email-Eu-core network.The abnormal NMI value, which alone can highlight the drawback of the NMI measure, is obtained by the Leiden algorithm, with a value of 0.65.Leiden returns 1005 communities, meaning that each community consists of only one node.In practice, this is a worst-case scenario, as considering each node as a separate community in relation to the ground truth (42 communities) is a noticeable fault.
Furthermore, as shown in Fig. 3, 100% of the cells in CT for the Leiden algorithm have a value of 1, indicating that the communities detected by Leiden have only one common member with the ground truth communities.It is evident that the NMI value of 0.65 is a significantly biased value, whereas the ARI is zero for this algorithm, providing a more realistic measure of common members with the ground truth.The problem arises when an analyst deploys only the NMI without considering information about R, S, or common members, which is a common approach in scientific experiments in community detection studies.In such cases, the analyst may interpret Leiden or infomap as the best algorithm, while Leiden is actually the worst.
Now, let us consider the second-highest value of R in Table 8, which is the result of the edge betweenness (GN) algorithm.Its R value is 731, which is significantly greater than S, leading to the distribution of members in numerous communities.Consequently, it is less likely that the communities detected by GN align with the

Conclusion
In this study, we conducted an in-depth analysis of one of the most recognized measures employed in the evaluation of clustering and community detection algorithms.Utilizing a mathematical approach, we demonstrated the inherent bias of this measure (NMI), a bias that becomes particularly pronounced when the number of communities detected by a given algorithm surpasses the number of communities present in the ground truth.Our findings underscore the significant impact that the number of detected communities can have on each evaluation metric, an effect that is especially notable in logarithmic entropy-based metrics such as NMI.This observation is critical because it highlights the potential for skewed results and misinterpretations when using this metric in different applications.
The findings of this study can be generalized and applied in various contexts that utilize community detection evaluation metrics, ranging from friendship networks to critical applications such as identifying certain cancer diseases in protein-protein interaction (PPI) networks.Our study has highlighted and formally proven that the NMI is a biased metric for evaluating the results of community detection algorithms in any application.For instance, an algorithm that fails to detect the protein structure in PPI networks may be considered successful based on the NMI, potentially leading to the misdiagnosis of cancer.Therefore, our findings underscore the need for careful consideration of the characteristics and limitations of NMI, particularly in scenarios where the number of detected communities is high.Therefore, this study formally indicates that in the future, any field of study intending to base decisions on community detection algorithms should exercise caution in selecting the appropriate metric for evaluating these algorithms.This applies across various domains, such as marketing, where accurately targeting communities is crucial, or in information diffusion methods, where identifying dense communities is vital for activating more nodes.
However, the primary contribution of this study lies in revealing the root cause of this biased behavior, originating from the logarithmic function and its corresponding derivative.This insight holds significant value for future studies focused on designing new and equitable metrics within this domain.Understanding the mathematical behavior of this logarithmic metric can substantially aid in the creation of more precise evaluation metrics.

Figure 1 .
Figure 1.The response of different metrics based on varying numbers of communities.

Figure 2 .
Figure 2. Nonlinear (approximately logarithmic) decreases in the P value.

Figure 3 .
Figure 3. Cumulative Distribution Function (CDF) of common members based on CT.
We analyze all possible states based on the number of communities.Table4and Fig.1present the results of NMI values for different states compared to those of the ARI, PWF, Fowlkes Mallows, Hubert statistics, and Jaccard.

Table 4 .
The results of evaluation metrics based on all possible numbers of communities for a sample network with nodes.

Table 5 .
Contingency table of the sample community detection algorithm.NMI.Additionally, if the number of non-zero members in Z is greater than R, it indicates that the number of common members between S and R is less, further contributing to the reduction in NMI.

Table 6
displays the contingency table of Alg1.Table 7 presents the contingency table of Alg 2.

Table 8 .
NMI values based on seven well-known community detection algorithms applied to the email-Eucore network dataset.